-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce timeout for hanging workspaces #605
Conversation
Add setting workspace.startProgressTimeout to denote maximum duration for any workspace phase before the workspace start is failed. Default value is 5m Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Signed-off-by: Angel Misevski <amisevsk@redhat.com>
7ab914e
to
cda34d6
Compare
// a "Starting" phase without progressing before it is automatically failed. | ||
// Duration should be specified in a format parseable by Go's time package, e.g. | ||
// "15m", "20s", "1h30m", etc. If not specified, the default value of "5m" is used. | ||
StartProgressTimeout string `json:"startProgressTimeout,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this might make more sense to be something like StartPhaseTimeout
? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed to ProgressTimeout
to include Failing dws and forgot this comment -- sorry. Are you okay with that name? My thinking is it's the max duration a DevWorkspace can exist without progressing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tested and everything worked as expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM I just have some question, which I may find by myself during testing, but I'm not there yet.
@@ -355,6 +376,9 @@ func (r *DevWorkspaceReconciler) Reconcile(ctx context.Context, req ctrl.Request | |||
} | |||
reqLogger.Info("Waiting on deployment to be ready") | |||
reconcileStatus.setConditionFalse(conditions.DeploymentReady, "Waiting for workspace deployment") | |||
if !deploymentStatus.Requeue && deploymentStatus.Err == nil { | |||
return reconcile.Result{RequeueAfter: startingWorkspaceRequeueInterval}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does here the last or the first internal win?
- I mean we wait for deployment initially and scheduled reconciling in 5 secs (1);
- Deployment created pods and it invokes reconcile loop, we scheduling in 5 secs (2);
- Pod is updated several times after containers status are propagates, we scheduled N times in 5 secs(N);
The question is, how many times reconcile loops our RequeueAfter
will initiate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that reconciling due to an event cancels out a requeueAfter, but it's hard to check for sure. In my testing, I don't see bursts of "waiting on deployment" in the logs, which is what would likely happen if we were stacking requeues.
@@ -59,6 +59,10 @@ import ( | |||
dw "github.com/devfile/api/v2/pkg/apis/workspaces/v1alpha2" | |||
) | |||
|
|||
const ( | |||
startingWorkspaceRequeueInterval = 5 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if 5 secs is always the best choice or we can make it dynamic and like to minute, 2,3 in case of 5minutes timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly picked 5 seconds because it feels long enough that it's not burdening the controller (we can reconcile many times a second) but also short enough to avoid strange issues in setting a timeout (e.g. if we set a 1 minute requeue, what happens if the config specifies a timeout of 1 minute 15 seconds?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 minute 15 seconds
I don't think there is a case for such a precise timeout. Maybe we can declare our precision to like 1minute or 30 seconds.
And actually, I thought about reconciling after (timeout + last transition time - now + 5 sec), which will initiate reconcile loop after 5 sec when potentially we need to kill that.
But I'm OK with any if it does not generate redundant loading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this informally by starting a workspace that will timeout (ignore FailedScheduling
events and ask for 100Gi in Theia). While the workspace is looping on checking the deployment every 5 seconds, I ran
for i in {1..5}; do
kubectl patch dw theia-next --type merge -p '{"metadata": {"labels": {"touch": "false"}}}'
sleep 0.5
kubectl patch dw theia-next --type merge -p '{"metadata": {"labels": {"touch": "true"}}}'
sleep 0.5
done
to trigger 10 reconciles to the object, with the assumption that each of those would also start a RequeueAfter
. However, after the 10 quick reconciles are completed, the controller goes back to queuing reconciles every five seconds, rather than 10 reconciles every 5 seconds, so it seems like RequeueAfter is cancelled if an event triggers a reconcile.
I agree that 5 seconds may be the wrong value here; we should tweak this later.
Rename startProgressTimeout to progressTimeout and repurpose it to detect workspaces stuck in the "Failing" state for too long as well. Signed-off-by: Angel Misevski <amisevsk@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any reason to delay delivering this valuable feature.
Approving and if I find any way to improve/optimize(if it's really needed) implementation - I'll raise a dedicated issue.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: amisevsk, JPinkney, sleshchenko The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test v8-devworkspace-operator-e2e, v8-che-happy-path |
@amisevsk: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
I assume it's caused by a health check that understands redirect as success. |
hi team i have installed eclipse-che in k8s cluster. i'm facing below timeout error. i tried increasing the timeout in
|
Hi @Divine1 -- that error will occur when the DevWorkspace does not reach a "Running" state before the start timeout (introduced in this PR). By default this is 5 minutes but can be increased by setting the apiVersion: controller.devfile.io/v1alpha1
kind: DevWorkspaceOperatorConfig
metadata:
name: devworkspace-operator-config
namespace: <devworkspace/che installation namespace>
config:
workspace:
progressTimeout: 15m # This sets the timeout to 15 minutes However, if it's taking longer than 5 minutes for your workspace to start, that suggests a different problem. Do you have any information about the workspace or its pods while it's starting? Some things to check:
From here, we can start to narrow down what's causing the issue. Feel free to open another issue in this repo and we can discuss there. |
@tolusha Thank you for checking this i have posted logs in below link which show detailed logs are in this link eclipse-che/che#21613 (comment) |
@tolusha i'm able to see errors in One more doubt... i noticed, the |
If the issue is that pulling images takes longer than 5 minutes, then the solutions would be either
|
Note: this PR depends on #598 and so for now includes those changes as well; ignore all but the last three commits (i.e. 05c0441 to the end). Once #598 is merged, this commit will be a lot simpler.
What does this PR do?
Introduces config setting
.workspace.startProgressTimeout
that controls how long a workspace can be in the starting phase without progressing before it is automatically failed."Progressing" in our case means "updating workspace
.status.conditions
" -- if the last condition change (checked bylastTransitionTime
) is more thanidleTimeout
ago, we assume the workspace has failed.Default timeout value is 5 minutes, to allow for e.g. an incredibly slow image pull.
What issues does this PR fix or reference?
Closes #578
Is it tested? How?
DevWorkspaceOperatorConfig
to ignoreFailedScheduling
events (and optionally set a shorteridleTimeout
):PR Checklist
/test v8-devworkspace-operator-e2e, v8-che-happy-path
to trigger)v8-devworkspace-operator-e2e
: DevWorkspace e2e testv8-che-happy-path
: Happy path for verification integration with Che